Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms

نویسندگان

  • Khalid Hasanov
  • Jean-Noël Quintin
  • Alexey L. Lastovetsky
چکیده

Article history: Available online xxxx Keywords: MPI Broadcast BlueGene Grid'5000 Extreme-scale Communication Hierarchy a b s t r a c t Significant research has been conducted in collective communication operations, in particular in MPI broadcast, on distributed memory platforms. Most of the research efforts aim to optimize the collective operations for particular architectures by taking into account either their topology or platform parameters. In this work we propose a simple but general approach to optimization of the legacy MPI broadcast algorithms, which are widely used in MPICH and Open MPI. The proposed optimization technique is designed to address the challenge of extreme scale of future HPC platforms. It is based on hierarchical transformation of the traditionally flat logical arrangement of communicating processors. Theoretical analysis and experimental results on IBM BlueGene/P and a cluster of the Grid'5000 platform are presented. The Message Passing Interface (MPI) [1] is one of the core building blocks of scientific software libraries for parallel applications. For example, PETSc [2] library, which is used for simulation and modeling in different application domains such as Nanosimulations, Aerodynamics, Geosciences, Computational Fluid Dynamics and others, utilizes MPI for all inter-process data communication operations. Depending on the application, MPI collective communication operations can provide significant performance improvements over MPI point-to-point communication routines. One of the commonly used collective operations, MPI broadcast, is used in a variety of basic scientific kernels such as parallel matrix–matrix multiplication, LU factorization and along with others. During a broadcast operation, the root process sends a message to all other processes in the specified group of processes. The implementations of the broadcast operation in MPICH [3] and Open MPI [4] are typically based on linear, binary, binomial and pipelined algorithms [5]. The linear algorithms are not good for large numbers of processes, the binary and binomial algorithms are not efficient for large data sizes. On the other hand, pipelined algorithms are more efficient for larger numbers of processes and data sizes. Other widely used broadcast algorithms are scatter-ring-allgather and scatter-recursive-doubling-allgather [6], which have been implemented in MPICH.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Level Topology-Oblivious Optimization of MPI Broadcast Algorithms on Extreme-Scale Platforms

There has been a significant research in collective communication operations, in particular in MPI broadcast, on distributed memory platforms. Most of the research works are done to optimize the collective operations for particular architectures by taking into account either their topology or platform parameters. In this work we propose a very simple and at the same time general approach to opt...

متن کامل

Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms

Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective...

متن کامل

Locality and Topology Aware Intra-node Communication among Multicore CPUs

A major trend in HPC is the escalation toward manycore, where systems are composed of shared memory nodes featuring numerous processing units. Unfortunately, with scale comes complexity, here in the form of non-uniform memory accesses and cache hierarchies. For most HPC applications, harnessing the power of multicores is hindered by the topology oblivious tuning of the MPI library. In this pape...

متن کامل

Collective Framework and Performance Optimizations to Open MPI for Cray XT Platforms

The performance and scalability of collective operations plays a key role in the performance and scalability of many scientific applications. Within the Open MPI code base we have developed a general purpose hierarchical collective operations framework called Cheetah, and applied it at large scale on the Oak Ridge Leadership Computing Facility’s Jaguar (OLCF) platform, obtaining better performa...

متن کامل

Broadcast Routing in Wireless Ad-Hoc Networks: A Particle Swarm optimization Approach

While routing in multi-hop packet radio networks (static Ad-hoc wireless networks), it is crucial to minimize power consumption since nodes are powered by batteries of limited capacity and it is expensive to recharge the device. This paper studies the problem of broadcast routing in radio networks. Given a network with an identified source node, any broadcast routing is considered as a directed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Simulation Modelling Practice and Theory

دوره 58  شماره 

صفحات  -

تاریخ انتشار 2015